658 research outputs found
Automatic recognition of fingerspelled words in British Sign Language
We investigate the problem of recognizing words from
video, fingerspelled using the British Sign Language (BSL)
fingerspelling alphabet. This is a challenging task since the
BSL alphabet involves both hands occluding each other, and
contains signs which are ambiguous from the observer’s
viewpoint. The main contributions of our work include:
(i) recognition based on hand shape alone, not requiring
motion cues; (ii) robust visual features for hand shape
recognition; (iii) scalability to large lexicon recognition
with no re-training.
We report results on a dataset of 1,000 low quality webcam
videos of 100 words. The proposed method achieves a
word recognition accuracy of 98.9%
"'Who are you?' - Learning person specific classifiers from video"
We investigate the problem of automatically labelling
faces of characters in TV or movie material with their
names, using only weak supervision from automaticallyaligned
subtitle and script text. Our previous work (Everingham
et al. [8]) demonstrated promising results on the
task, but the coverage of the method (proportion of video
labelled) and generalization was limited by a restriction to
frontal faces and nearest neighbour classification.
In this paper we build on that method, extending the coverage
greatly by the detection and recognition of characters
in profile views. In addition, we make the following contributions:
(i) seamless tracking, integration and recognition
of profile and frontal detections, and (ii) a character specific
multiple kernel classifier which is able to learn the features
best able to discriminate between the characters.
We report results on seven episodes of the TV series
“Buffy the Vampire Slayer”, demonstrating significantly increased
coverage and performance with respect to previous
methods on this material
Taking the bite out of automated naming of characters in TV video
We investigate the problem of automatically labelling appearances of characters in TV or film material
with their names. This is tremendously challenging due to the huge variation in imaged appearance of each character and the weakness and ambiguity of available annotation. However, we demonstrate that high precision can be achieved by combining multiple sources of information, both visual and textual. The principal novelties that we introduce are: (i) automatic generation of time stamped character annotation by aligning subtitles and transcripts; (ii) strengthening the supervisory information by identifying
when characters are speaking. In addition, we incorporate complementary cues of face matching and clothing matching to propose common annotations for face tracks, and consider choices of classifier which can potentially correct errors made in the automatic extraction of training data from the weak textual annotation. Results are presented on episodes of the TV series ‘‘Buffy the Vampire Slayer”
Convolutional Networks for Object Category and 3D Pose Estimation from 2D Images
Current CNN-based algorithms for recovering the 3D pose of an object in an
image assume knowledge about both the object category and its 2D localization
in the image. In this paper, we relax one of these constraints and propose to
solve the task of joint object category and 3D pose estimation from an image
assuming known 2D localization. We design a new architecture for this task
composed of a feature network that is shared between subtasks, an object
categorization network built on top of the feature network, and a collection of
category dependent pose regression networks. We also introduce suitable loss
functions and a training method for the new architecture. Experiments on the
challenging PASCAL3D+ dataset show state-of-the-art performance in the joint
categorization and pose estimation task. Moreover, our performance on the joint
task is comparable to the performance of state-of-the-art methods on the
simpler 3D pose estimation with known object category task
Two types of S phase precipitates in Al-Cu-Mg alloys
Transmission electron microscopy (TEM) and differential scanning calorimetry (DSC) have been used to study S phase precipitation in an Al-4.2Cu-1.5Mg-0.6Mn-0.5Si (AA2024) and an Al-4.2Cu-1.5Mg-0.6Mn-0.08Si (AA2324) (wt-%) alloy. In DSC experiments on as solution treated samples two distinct exothermic peaks are observed in the range 250 to 350°C, whereas only one peak is observed in solution treated and subsequently stretched or cold worked samples. Samples heated to 270°C and 400°C at a rate of 10°C/min in the DSC have been studied by TEM. The selected area diffraction patterns show that S phase precipitates with the classic orientation relationship form during the lower temperature peak, and for the solution treated samples, the higher temperature peak is caused by the formation of a second type of S phase precipitates which have an orientation relationship that is rotated by ~4 degrees to the classic one. The effects of Si and cold work on the formation of second type of S precipitates have been discussed
Impact of adversarial examples on deep learning models for biomedical image segmentation
Deep learning models, which are increasingly being used in the field of medical image analysis, come with a major security risk, namely, their vulnerability to adversarial examples. Adversarial examples are carefully crafted samples that force machine learning models to make mistakes during testing time. These malicious samples have been shown to be highly effective in misguiding classification tasks. However, research on the influence of adversarial examples on segmentation is significantly lacking. Given that a large portion of medical imaging problems are effectively segmentation problems, we analyze the impact of adversarial examples on deep learning-based image segmentation models. Specifically, we expose the vulnerability of these models to adversarial examples by proposing the Adaptive Segmentation Mask Attack (ASMA). This novel algorithm makes it possible to craft targeted adversarial examples that come with (1) high intersection-over-union rates between the target adversarial mask and the prediction and (2) with perturbation that is, for the most part, invisible to the bare eye. We lay out experimental and visual evidence by showing results obtained for the ISIC skin lesion segmentation challenge and the problem of glaucoma optic disc segmentation. An implementation of this algorithm and additional examples can be found at https://github.com/utkuozbulak/adaptive-segmentation-mask-attack
A Comparison and Strategy of Semantic Segmentation on Remote Sensing Images
In recent years, with the development of aerospace technology, we use more
and more images captured by satellites to obtain information. But a large
number of useless raw images, limited data storage resource and poor
transmission capability on satellites hinder our use of valuable images.
Therefore, it is necessary to deploy an on-orbit semantic segmentation model to
filter out useless images before data transmission. In this paper, we present a
detailed comparison on the recent deep learning models. Considering the
computing environment of satellites, we compare methods from accuracy,
parameters and resource consumption on the same public dataset. And we also
analyze the relation between them. Based on experimental results, we further
propose a viable on-orbit semantic segmentation strategy. It will be deployed
on the TianZhi-2 satellite which supports deep learning methods and will be
lunched soon.Comment: 8 pages, 3 figures, ICNC-FSKD 201
Learning Dilation Factors for Semantic Segmentation of Street Scenes
Contextual information is crucial for semantic segmentation. However, finding
the optimal trade-off between keeping desired fine details and at the same time
providing sufficiently large receptive fields is non trivial. This is even more
so, when objects or classes present in an image significantly vary in size.
Dilated convolutions have proven valuable for semantic segmentation, because
they allow to increase the size of the receptive field without sacrificing
image resolution. However, in current state-of-the-art methods, dilation
parameters are hand-tuned and fixed. In this paper, we present an approach for
learning dilation parameters adaptively per channel, consistently improving
semantic segmentation results on street-scene datasets like Cityscapes and
Camvid.Comment: GCPR201
What is Holding Back Convnets for Detection?
Convolutional neural networks have recently shown excellent results in
general object detection and many other tasks. Albeit very effective, they
involve many user-defined design choices. In this paper we want to better
understand these choices by inspecting two key aspects "what did the network
learn?", and "what can the network learn?". We exploit new annotations
(Pascal3D+), to enable a new empirical analysis of the R-CNN detector. Despite
common belief, our results indicate that existing state-of-the-art convnet
architectures are not invariant to various appearance factors. In fact, all
considered networks have similar weak points which cannot be mitigated by
simply increasing the training data (architectural changes are needed). We show
that overall performance can improve when using image renderings for data
augmentation. We report the best known results on the Pascal3D+ detection and
view-point estimation tasks
LoANs: Weakly Supervised Object Detection with Localizer Assessor Networks
Recently, deep neural networks have achieved remarkable performance on the
task of object detection and recognition. The reason for this success is mainly
grounded in the availability of large scale, fully annotated datasets, but the
creation of such a dataset is a complicated and costly task. In this paper, we
propose a novel method for weakly supervised object detection that simplifies
the process of gathering data for training an object detector. We train an
ensemble of two models that work together in a student-teacher fashion. Our
student (localizer) is a model that learns to localize an object, the teacher
(assessor) assesses the quality of the localization and provides feedback to
the student. The student uses this feedback to learn how to localize objects
and is thus entirely supervised by the teacher, as we are using no labels for
training the localizer. In our experiments, we show that our model is very
robust to noise and reaches competitive performance compared to a
state-of-the-art fully supervised approach. We also show the simplicity of
creating a new dataset, based on a few videos (e.g. downloaded from YouTube)
and artificially generated data.Comment: To appear in AMV18. Code, datasets and models available at
https://github.com/Bartzi/loan
- …